46 research outputs found

    Reinforcement Learning for the Unit Commitment Problem

    Full text link
    In this work we solve the day-ahead unit commitment (UC) problem, by formulating it as a Markov decision process (MDP) and finding a low-cost policy for generation scheduling. We present two reinforcement learning algorithms, and devise a third one. We compare our results to previous work that uses simulated annealing (SA), and show a 27% improvement in operation costs, with running time of 2.5 minutes (compared to 2.5 hours of existing state-of-the-art).Comment: Accepted and presented in IEEE PES PowerTech, Eindhoven 2015, paper ID 46273

    Chance-Constrained Outage Scheduling using a Machine Learning Proxy

    Full text link
    Outage scheduling aims at defining, over a horizon of several months to years, when different components needing maintenance should be taken out of operation. Its objective is to minimize operation-cost expectation while satisfying reliability-related constraints. We propose a distributed scenario-based chance-constrained optimization formulation for this problem. To tackle tractability issues arising in large networks, we use machine learning to build a proxy for predicting outcomes of power system operation processes in this context. On the IEEE-RTS79 and IEEE-RTS96 networks, our solution obtains cheaper and more reliable plans than other candidates

    Beyond the One Step Greedy Approach in Reinforcement Learning

    Get PDF
    The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, nn-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has to our knowledge not been carefully analyzed yet. In this work, we introduce the first such analysis. Namely, we formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence. Moreover, we show that recent prominent Reinforcement Learning algorithms are, in fact, instances of our framework. We thus shed light on their empirical success and give a recipe for deriving new algorithms for future study.Comment: ICML 201

    Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

    Get PDF
    Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.Comment: NIPS 201

    Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

    Full text link
    As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currently not possible to deploy AI models on network devices due to their limited computational capabilities. Here, we offer a solution to this problem by building a computationally-light solution based on a recent reinforcement learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC by x500 by distilling its complex neural network into decision trees. This transformation enables real-time inference within the μ\mu-sec decision-time requirement, with a negligible effect on quality. We deploy the transformed policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used in production, RL-CC is the only method that performs well on all benchmarks tested over a large range of number of flows. It balances multiple metrics simultaneously: bandwidth, latency, and packet drops. These results suggest that data-driven methods for CC are feasible, challenging the prior belief that handcrafted heuristics are necessary to achieve optimal performance

    A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

    Full text link
    Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates, θn\theta_n and wn,w_n, which are updated using two distinct stepsize sequences, αn\alpha_n and βn,\beta_n, respectively. Assuming αn=n−α\alpha_n = n^{-\alpha} and βn=n−β\beta_n = n^{-\beta} with 1>α>β>0,1 > \alpha > \beta > 0, we show that, with high probability, the two iterates converge to their respective solutions θ∗\theta^* and w∗w^* at rates given by ∥θn−θ∗∥=O~(n−α/2)\|\theta_n - \theta^*\| = \tilde{O}( n^{-\alpha/2}) and ∥wn−w∗∥=O~(n−β/2);\|w_n - w^*\| = \tilde{O}(n^{-\beta/2}); here, O~\tilde{O} hides logarithmic terms. Via comparable lower bounds, we show that these bounds are, in fact, tight. To the best of our knowledge, ours is the first finite-time analysis which achieves these rates. While it was known that the two timescale components decouple asymptotically, our results depict this phenomenon more explicitly by showing that it in fact happens from some finite time onwards. Lastly, compared to existing works, our result applies to a broader family of stepsizes, including non-square summable ones
    corecore